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1.  INTRODUCTION 

Determining  the  interconnection  weights  of  a  feed-forward  multilayered 
neural  network  so  that  the  resulting  transfer  function  (input-output  map) 
will  map  a  certain  set  of  inputs  to  a  corresponding  set  of  desired  outputs  is 
viewed  here  as  an  interpolation  problem.  The  layered  net,  which  has  L  £  1 
layers  of  weights,  is  described  in  the  next  section  together  with  a  statement  of 
the  interpolation  problem  and  some  preliminary  results. 

In  Section  3  it  is  shown  how  one  can  interpolate  through  a  set  of  mL.j  +  1 
input-output  points  (or  less)  with  distinct  inputs,  where  mL_j  is  the  number  of 
neurons  in  the  layer  preceding  the  output  layer.  This  can  be  accomplished  by 
a  proper  choice  of  the  last  layer  of  weights.  A  closed-form  expression  for 
these  weights  is  given  in  terms  of  the  m^.i  +  1  points  of  interpolation.  These 

weights  are  a  function  of  all  of  the  weights  in  the  preceding  layers,  which 
may  be  chosen  at  random. 

Section  4  discusses  nets  with  only  two  layers  of  weights  (L  =  2).  A  method 
is  presented  for  determining  all  of  its  weights  so  that  its  transfer  function 
interpolates  through  a  set  of  m0  +  1  points  (or  less),  where  m0  is  the  number  of 

neurons  in  the  input  layer.  The  two  methods  for  selecting  the  weights  are 
compared  in  Section  5,  and  suggestions  for  their  applications  are  given. 

The  freedom  that  exists  in  the  selection  of  the  first  L  •  1  layers  of  weights 
when  using  the  method  of  Section  3  can  be  used  to  reduce  the  sensitivity  (to 
noisy  input  patterns)  of  the  resulting  input-output  map.  The  sensitivity  of  the 
transfer  function  at  an  interpolation  point  is  measured  here  by  the  norm  of 
the  Jacobian  matrix  (total  derivative)  of  the  transfer  function  at  the  given 
point.  Since  a  small  change  in  the  input  produces  a  change  in  the  output 
whose  magnitude  is  approximately  bounded  by  the  product  of  the  norm  of  the 
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Jacobian  matrix  and  the  magnitude  of  the  change  in  the  input,  it  is  suggested 
that  by  minimizing  the  norm  of  the  Jacobian  matrix  at  the  interpolation  points 
the  change  in  output  produced  by  a  small  change  in  the  input  can  be 
minimized.  Thus,  an  expression  for  the  Jacobian  matrix  of  the  transfer 
function  is  derived  and  is  presented  in  Section  6.  Before  computing  its  norm, 
the  induced  (p,  q)  matrix  norms  are  introduced  together  with  some  of  their 
properties.  A  judicious  choice  for  p  and  q  yields  computable  upper  bounds  for 
the  norm  of  the  Jacobian  matrix.  The  results  suggest  that  small  weights  are 

required  for  low  sensitivity. 

For  an  introduction  to  feed-forward  layered  neural  nets  (FLNNs)  and  some 
of  their  basic  properties,  the  reader  is  referred  to  References  1  and  2. 

2.  NOTATION,  PROBLEM  STATEMENT,  AND  PRELIMINARY  RESULTS 

We  will  consider  layered  neural  nets  with  architecture  m  =  (m0,m1,..,mL). 

This  means  (Reference  2)  that  the  net  consists  of  L  +  1  layers  of  neurons  with 
m  j  neurons  in  the  ith  layer  (0  S  i  S  L).  The  activation  function  of  each  neuron 

will  be  denoted  by  S:  R-»  (-1,  1),  where  R  is  the  set  of  real  numbers  and  S  is 
assumed  to  be  a  strictly  increasing  continuous  sigmoid  mapping  R  onto 

the  open  interval  (-1,  1).  Thus,  S  is  invertible  with  inverse  S'1:  (-1,1 ) — »  R. 

Euclidean  n-space  with  the  standard  metric  topology  will  be  denoted  by  Rn,  the 

set  of  real  nxm  matrices  will  be  denoted  by  Rnxm,  and  (-1,  1)"  *  (x  =  [xlt  x2,  .... 
xn]T  e  Rn:  Ixjl  <  1,  i  =  1,  2,  ....  n}.  For  each  positive  integer  n,  let  Sn:  Rn->  (-1,  1>" 

be  defined  by  Sn(x)  =  [S(xj),  S(x2),  ....  S(xn)]T,  for  all  x  =  [xlf  x2 xn]T  €  R". 

Similarly,  let  S'1:  (-1,  l)n->  Rn  be  defined  by  S”!(x)  -  (S_1(x1),  S'^xj),  .... 

S "1(xn)]T,  for  all  x  =  [xlt  x2,  ....  xn]T  e  (-1,  l)n.  The  superscript  T  denotes 

transpose. 

For  each  P  e  Rn,  let  j3:  Rn-»  Rn  denote  the  operator  defined  by  P(x)  *  x  +  p, 
(x  e  Rn).  If  f  and  g  are  two  functions  with  the  range  of  g  contained  in  the 
domain  of  f,  then  f»g  will  denote  the  composition  of  f  and  g.  The  same  symbol 
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will  be  used  to  represent  a  linear  transformation  and  its  matrix  with  respect  to 
the  standard  orthonormal  bases  on  its  domain  and  range. 


If  Wj  e  RWiXmil  (1  <  i  <  L)  and  ft  €  Rm;  (0  <  i  <  L),  let  Tj  =  (Sm .^Wj):  Rmi->  -> 
(-1,  l)1"1  (1  <  i  <  L)  and  let  T0:  Rm°-»  Am°  denote  either  the  identity  map  [in 

which  case,  A  =  R],  the  composition  Smo°Pmo,  or  simply  SmQ  [in  which  cases, 
A  =  (-1,1)].  Note  that  T0  is  injective;  thus,  one  may  define  its  inverse  Tq1  as 


follows: 


T0:  Rmo  ->  Rmo, 

S"1 :  (-1,  l)m°  -*  Rm°, 

,uo 

(-P0)A  =S^0:  (-l,l)m°  Rm°, 


if  T0  =  identity  map 

if  T0  =  Smo 
if  T0=  Sm0°3o- 


With  this  notation,  the  transfer  function  (input-output  map)  of  the  FLNN  with 
architecture  m  =  (m0,  mj,  ....  mL)  and  activation  function  S  that  will  be 

considered  here  can  be  written  as  the  composition 

F  s  tl°tl.i0...oTi°to  :  Rm°-»  (-l,l)mK  (2.1) 

Remark  2.1.  For  the  sake  of  generality,  three  possibilities  have  been 
allowed  for  T0  in  order  to  accommodate  exceptions  or  variations  in  the 

interpretation  or  in  the  use  of  the  first  layer  of  neurons,  which  is  considered 
by  some  authors  as  being  only  an  "input  layer"  (T0  =  identity  function),  or  may 
be  used  simply  to  "normalize  the  input"  (Tq  =  Smo),  or  to  "normalize  and  center 

the  input"  (T0  =  Smo°Po).  However,  if  T0  =  Srno°Po.  it  will  be  assumed  that  p0  6  Rm° 

is  free  to  be  chosen  to  satisfy  some  criterion  other  than  the  interpolation 
problem  (e.g.,  to  "center"  the  set  of  input  data)  and  that  once  P0  has  been 
chosen,  it  remains  fixed.  Therefore,  the  only  free  parameters  are  Ws  and  Pj, 
for  i  =  1,2,  ....  L. 

Interpolation  Problem  1  (IP1).  Given  a  set  of  points  of  interpolation 
6  S  {(!;,  Oj)  €  Rm°x(-l,l)mL:  1  S  is  k  and  Ij*  Ij  for  i  *  j),  (2.2) 
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determine  W*  and  Pt  (1  <  i  £  L)  such  that 

F(Ij)  =  Oj,  for  all  i  =  1,  2,  ....  k.  (2.3) 

Since  the  sigmoid  S  is  injective  and  p0  is  assumed  to  be  fixed,  the  IP  l  can 

be  reduced  to  an  apparently  simpler  problem  that  circumvents  having  to  treat 
the  three  possible  choices  for  T0  separately.  This  problem  is  defined  next. 

Let  F:  Rm°-*  RmL  denote  the  map  defined  by  Equation  2.1  with  T0  equal  to 
the  identity  map  and  TL  replaced  by  (Pl°Wl):  RmL-'  RmL. 

The  Interpolation  Problem  (IP).  Given  a  set  of  points  of 
interpolation, 


Cl  =  {(xj,  yj)  e  Rm°xRmL:l  <  i  <  k  and  Xj  *  Xj  for  i  *  j } , 


(2.4) 


determine  Wj  and  P;  (1  <  i  <  L)  such  that 

F(xj)  =  Yj.  for  all  i  =  1,  2 . k.  (2.5) 

The  following  proposition  shows  to  what  extent  the  two  interpolation 
problems  are  equivalent. 

Proposition  2.1.  Let  the  integer  k  >  0  be  fixed. 

(a)  If  IP  has  a  solution  for  every  set  of  interpolation  points  Cl  as  in 
Equation  2.4,  then  IP1  has  a  solution  for  every  set  of  interpolation  points  Cl  as  in 
Equation  2.2. 

(b)  If  I  PI  has  a  solution  for  every  set  as  in  Equation  2.2,  then  IP  has 
a  solution  for  every  set  of  interpolation  points  Cl,  where  Cl  is  as  in  Equation  2.4 
if  T0  =  identity  map  and  Q  =  ((xit  y{)  e  (-1,  l)m°xRmL:l  <  i  <  k  and  Xj  *  for  i  *  j), 

T0  =  sm0  °r  T0  -  Smo°  Po- 
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Proof.  The  result  follows  from  the  definition  of  TJ1  and  the  fact  that 
F(I)  =  O  <=>  F(x)  =  y  with  x  =  T0(I)  and  y  =  SmL(0).  ///A 

It  follows  from  part  (a)  of  the  proposition  that  it  suffices  to  solve  the  IP.  The 

proposition  also  shows  how  to  handle  the  variations  in  the  use  of  the  first 
layer.  Consequently,  for  the  remainder  of  the  paper  it  will  be  assumed, 
without  loss  of  generality,  that  the  FLNN  has  transfer  function  F:  Rm°-*  RmL 
and  the  FLNN  will  be  referred  to  as  "the  net." 

Remark  2.2.  One  reason  for  requiring  S  to  be  injective  is  for 

Proposition  2.1  to  hold.  This  increases  the  generality  of  the  results  that  follow 
by  allowing  several  possible  forms  for  T0  (see  Remark  2.1)  at  the  expense  of 

restricting  the  class  of  permissible  activation  functions.  Another  alternative 

is  to  assume  that  the  transfer  function  of  the  FLNN  is  given  by  F  (which  is  the 
case  for  the  remainder  of  the  paper)  and  remove  the  restriction  on  S  of  being 
injective.  If  this  is  the  case,  then  the  class  of  permissible  activation  functions 

may  include  sigmoids  that  saturate,  in  which  case  it  will  be  assumed  that  the 
range  of  S  is  the  closed  interval  [-1,  1],  To  avoid  confusion,  we  will  indicate 
when  a  result  or  definition  also  holds  for  noninjective  activation  functions. 
Thus,  S  is  assumed  to  be  injective  throughout,  unless  otherwise  stated. 

In  order  to  facilitate  the  statements  of  some  results,  we  make  the 
following: 

Definition  2.1. 

(a)  We  shall  say  that  F  interpolates  through  ft  if  £2  is  as  in  Equation 
2.4  and  Equation  2.5  holds. 

(b)  If  IP  has  a  solution,  we  shall  say  that  ft  is  realizable  by  the  net. 

(c)  The  largest  integer  k  with  the  property  that  every  set  of  points  ft  as 
in  Equation  2.4  is  realizable  by  a  net  with  architecture  m  will  be 
called  the  interpolation  capacity  of  the  net  and  will  be  denoted 
by  IC(m). 

*  The  symbol  ////  indicates  the  end  of  a  proof. 
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A  related  problem  is  that  of  characterizing  IC(m)  for  all  m  e  NL+1.  (Here, 
Nl+1  denotes  the  set  of  L  +  1  tuples  of  positive  integers.) 

By  a  simple  dimensionality  argument,  it  was  shown  (Reference  2)  that  for 
any  continuous  activation  function  S:  R  -»  [-1,  1]  and  for  all  me  NL+1  the 
interpolation  capacity  IC(m)  is  bounded  above  by  C(m),  where 

C(m)  s  -1—  \\  m^mj  j  +  1)  (me  NL+1).  (2.6) 

mL  ti 

Note  that  the  augmented  matrices  [WjiPj]  are  members  of  Rmix(m>-i+1)> 
i  =  1,  2,  ....  L.  Thus,  C(m)  is  simply  the  number  of  degrees  of  freedom  (that  is, 
the  number  of  parameters  that  need  to  be  specified  in  order  to  define  F 
uniquely)  divided  by  the  number  of  outputs. 

As  a  corollary  to  the  next  proposition,  one  can  obtain  a  sharper  upper 
bound  for  IC(m)  for  all  m  e  NL+1  with  mL  >  +  1. 

Proposition  2.2.  If  12  is  realizable,  then  at  most  mL.!  +  1  of  the  vectors 
y}  e  RmL  (1  <  i  <  k)  can  be  linearly  independent. 

We  need  the  following  notation  in  the  proof  of  this  proposition:  let  Fj  =  Tj 
and  Fn  =  T^F,,^  for  2  <  n  <  L. 

Proof.  Since  Q  is  realizable,  there  exist  Wj  and  Pj  (1  <  i  <  L)  such  that 
yi  =  F(xj)  =  (Pl°W L°Fl. j)(Xj)  =  WL  FL.,(xj)  +  PL=  [WLipL]  (1  <  i  <  k).  (2.7) 

f  F  Fl_i  (xi>  I  m  +i 

Since  the  set  j  — - —  e  R  :l<i<kj  contains  at  most  mL.!  +  1  linearly 

independent  vectors,  Equation  2.7  implies  that  at  most  mL.]  +  1  of  the  vectors 
y;  (1  <  i  <  k)  can  be  linearly  independent.  //// 
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Corollary  2.1.  If  mL  >  mL  j  +  1,  k  >  mL.j  +  1,  and  the  set  {yje  RmL:  1  S  i  £  k} 
contains  more  than  mL.!  +  1  linearly  independent  vectors,  then  the  set  {(xit  ys) 
e  Rm°xRmL:l  <  i  <  k}  is  not  realizable  by  a  net  with  architecture  m  =  (mQ.mj .-..mi) 
for  any  choice  of  vector  x,  e  Rm°  (1  <,  i  £  k). 

Corollary  2.2.  Ifm  =(m0,mj  ,„,mL)with  mL>  mL_j  +  1,  then  I  C(m)  <,  mL_  j  +  1. 

Remark  2.3.  Proposition  2.2  and  Corollaries  2.1  and  2.2  hold  for  any 
activation  function  S  if  the  net  has  transfer  function  F.  Corollary  T  2  also 
holds  for  a  net  with  transfer  function  F  if  S  is  invertible. 

The  following  example  shows  that  there  are  architectures  m  for  which 
IC(m)  <  C(m). 

Example  2.1.  If  L  =  2,  m0  =  rrij  + 1,  m2  =  nij  +2,  and  mj  S  1,  then  C(m)  =  2m,  +  1. 
Since  m2>m1  +  1,  by  Corollary  2.2  we  have  IC(m)  <  mj  +  1,  which  is  strictly  less 
than  C(m). 

3.  THE  LAST  LAYER  OF  WEIGHTS:  A  LOWER  BOUND  FOR  IC(m) 

In  this  section,  we  present  a  characterization  of  the  last  layer  of  weights 
[WL:PL]  of  a  net  whose  transfer  function  F  interpolates  through  a  set  of  points 
Q  =  { (xj,  y2)  :  1  <  i  <  k}.  This  characterization  involves  the  inverse  of  the  matrix 
X0(p')  defined  below.  Conditions  under  which  Xa(p’)  is  invertible  are  explored. 
We  find  that  Xa(p')  is  invertible  under  very  mild  conditions,  in  which  case  one 
obtains  the  following  lower  bound  for  IC(m): 

IC(m)  £  mL.j  +  1,  for  m  =  (m0,  mj,  ....  mL).  (3.1) 

Let  m  =  (m0,  mlt  ....  mL)  €  NL+1,  P(m)  ■  and  P(m)  ■ 

i-l  i-l 

r mix(mi.1+ 1)  jj.  a  ne(  jjas  architecture  m,  then  the  collection  of  all  transfer 

functions  associated  with  the  net  for  every  possible  set  of  weights  is  clearly 
parametrized  by  P(m).  For  each  point  p  ■  ([Wjipj,  [W2:P2] . [wl-PlD  €  P(*n), 
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let  Fp  denote  the  transfer  function  of  the  net  with  weights  p.  Similarly,  for 
each  p'=  ([Wjipj],  [W2ip2] . €  P'(th),  let 

FP'  =  (SmL,°iloWL-i)  o-^jOPtoW  j). 

n 

If  n  :  P(m)  -*  P  ’(m )  denotes  the  map  ([W  2  :P  2  ],  ....  [WL:pL])  -^([WjtPj] . 

[WL.jiPL.iD,  then 

Fp  =  (PL0WL0Fn(p)),  for  every  p  =  ([W^] . [WL:pL])  «  P(rh).  (3.2) 

Let  £2  =  ((Xj,  yt)  :  1  <  i  <  k)  be  a  set  of  interpolation  points,  Qa  =  {(xa.,  ya.): 
1  <  i  <  ti),  where  a  =  (alt  a2,  ....  an)  is  a  multi-index  with  a{  e  {1,  2,  ....  k}  for 
1  <  i  <  ri .  For  each  p’ e  P(m)  and  multi-index  a,  define 

X,(p')  =  FP,(xai);Fp'(xa2);  - 

L  l  •  l  : 


Fp-(x,^)  g  R(mUi+l)xii 


(3.3) 


Y0«ly0l  :ya2  i  •• 


•  y^l  6  R 


Proposition  3.1.  Let  tj  =  mL_i  +  1,  k  S  ti ,  and  a  =  (ax,  a2 .  a n)  with 

a,  e  (1,  2,  ....  k)  for  1  <  i  <  r\. 


(a)Ifpe  P(m),  Xa(n(p))  is  invertible,  and  Fp  interpolates  through  £20, 
then  p  *  (n(p),  [Wl:Pl])  with 

(WL:pLl  =  Ya  (X^mp))]1.  (3-4) 


(b)  Conversely,  if  p'  e  P'(m),  Xa(p')  is  invertible,  Equation  3.4  holds,  and 
p  s  (p\  [Wl‘:PlD.  then  Fp  interpolates  through  fia. 

Proof.  It  follows  from  Equation  2.7  that  Fp  interpolates  through  Qa  if, 
and  only  if,  p  =  (n(p),  [Wl:PlD  and 

Ya  =  [WL:PL]Xa(n(p)).  (3.5) 
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Thus,  if  Xa(n(p))  is  invertible,  then  Equation  3.5  implies  Equation  3.4.  This 
proves  (a).  Conversely,  if  Equation  3.4  holds,  then  [WL:(JL]  satisfies  Equation  3.5 
which  proves  (b).  //// 

Remark  3.1.  If  p  <  q  =  m^  +  1,  a  =  (aj,  a2 . ap)  and  the  columns  of 

Xa(p’)  are  linearly  independent,  then  there  exists  a  matrix  U  €  such 

that  the  augmented  matrix  [Xa(p'):U]  e  R 11X11  js  invertible.  If  the  inverse  of 

[Xa(p'):U]  is  partitioned  as  vj-  ,  with  Vj  e  R  11X11  and  V2  e  R(n't‘)xn,  then 

L  v2  J 

vlXot(P)  =  V  where  Ip  is  the  pxp  identity  matrix.  Therefore,  if  [WL  ipL]  =  Ya  Vj, 
then  [WLipL]  satisfies  Equation  3.5  so  that  with  p  =  (p’,  YaV1),  Fp  interpolates 
through  £2a. 

For  each  multi-index  a  =  (alt  a2,  ....  an),  with  q  =  mL.!  +  1,  let  Ea  denote  the 
set  of  all  points  p'  in  P'(m )  such  that  the  matrix  Xa(p')  is  invertible.  We  may 

define  a  map  Ta:  Ea -»  P(in)  by  ra(p')  =  (p*.  YJX^p')]*1).  Part  (b)  of  Proposition 

3.1  says  that  all  the  transfer  functions  Fr„(p')  (P‘  e  Ea)  interpolate  through  f2a; 

that  is,  they  satisfy 

Fr0(p-)(x)  =  y  for  all  (*.  y)  e  (3.6) 

Consequently,  if  Ea  is  not  empty,  then  there  exists  p  e  P(m)  such  that  Fp 

interpolates  through  fta. 

For  a  given  multi-index  a,  whether  or  not  the  set  Ea  is  empty  is  difficult 

to  answer  in  general,  since  it  depends  on  the  type  of  activation  function  in  the 
net.  The  next  proposition  shows  that  under  certain  conditions  the  set  Ea  is 

"large"  (see  Remark  3.2).  In  order  to  state  the  proposition,  we  must  introduce 
the  map  Aa:  P(rh )  -»  R  defined  by  Aa(p)  *  det  Xa(p),  where  Xa(p)  is  given  by 
Equation  3.3,  q  =  mL.j  +  1  and  det  Xa(p)  denotes  the  determinant  of  Xa(p).  Let 

VAa:  P'(m)  -*  R^  denote  the  gradient  of  Aa,  where  £  =  ^  m.Cm.  j  +  1).  A  point  p 

i-i 
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in  P(m)  has  £  coordinates  and  will  be  denoted  by  p  =  (pj,  p2,  ....  p^);  that  is,  we 
will  identify  P(m)  with  rA 

If  denotes  the  complement  of  Ea  in  P'(m ),  then  E^  =  A’^O).  Hence,  E^  is 
closed.  Let  be  the  set  of  points  p  such  that  VAa(p)  =  0. 

Proposition  3.3.  If  the  activation  function  S:  R  -*  (-1,  1)  is 

continuously  differentiable,  then  for  each  multi-index  a  with  q  components 

the  set  e£  -  has  Lebesgue  measure  zero. 

The  proof  of  this  proposition  is  based  on  the  following. 

Lemma  3.1.  If  S  is  continuously  differentiable,  then  for  each  p  e  E^  -  ^ 

there  exists  an  open  set  O  in  P'(m )  such  that  p  e  O  and  m(EViO)  =  0,  where  m 
denotes  the  Lebesgue  measure  on  P'(m). 

Proof:  If  p°  =  (p i,  p2,  p\)  e  Ef  and  VAa(p°)  *  0,  then  ■|^“(p°)  *  0  for 

some  j  e  {1,  2 . £).  To  simplify  the  notation,  we  may  assume  that  j  =  1.  Write 

P°  =  (Pi.  q°)  with  q°  *  (p2,  P3,  ....  p\).  Since  the  activation  function  S  is 
continuously  differentiable,  so  is  Aa;  therefore,  by  the  Implicit  Function 

Theorem  (Reference  3),  there  exist  open  sets  VcR1  and  UcR^'1  with  (Pj,  q°)  e 

VxU  and  a  unique  map  <p:U-»V  such  that  <p(q°)  =  Pi,  Aa(<p(q),  q)  =  0  for  all  q  g  U, 
and  Aa(pi ,  q)  *  0  if  (pj ,  q)  e  VxU  and  Pi  *  <p(q).  Thus, 

Ef  n  [VxU]  =  [«p(q),  q)  :  q  e  U).  (3.7) 

Next  we  will  show  that  m(E^  n  [VxU])  =  0. 
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Let  m i  denote  the  Lebesgue  measure  on  R*(/  =  1,  2,  ...)•  Since  e£  is  closed, 

Q  =  E^n  [VxU]  is  a  Borel  set  in  R**.  Therefore,  by  the  definition  of  the  product 
measure  mjxm^.i  (Reference  4)  and  Theorem  7.11  of  Reference  4,  we  have 

m^Q)  =  (mjxm^jXQ)  =  m^p^Cpj,  q)  €  Q})  dm^Cq). 

This  together  with  Equation  3.7  gives 

m^(Q)  =  J^m  i ( { «P (q) ) )  dm^.^q)  =  0  dm^.j(q)  =  0. 

By  letting  O  =  VxU,  the  proof  is  finished.  //// 

Proof  of  Proposition  3.3.  By  Lemma  3.1,  for  every  p  e  Ef  -  ,  there 

exists  an  open  set  Op  such  that  p  e  Op  and  m(l^  Pi  Op)  =  0.  Since  every  Euclidean 

space  is  second  countable  (Reference  5),  E^  -  can  be  covered  by  a  countable 
collection  of  the  sets  Op.  Finally,  since  Lebesgue  measure  is  countably  additive, 

we  conclude  that  m(E*  - )  =  0.  //// 

Remark  3.2.  If  also  m(E^  O  E^)  =  0,  then  it  follows  from  Proposition  3.3 
that  all  the  matrices  Xa(p)  (p  e  P'(rii))  are  invertible  except  for  those  p  in  the 

set  E6  of  measure  zero. 

OL 

As  a  corollary  to  Lemma  3.1,  we  can  obtain  a  weaker  condition  for  the 
existence  of  an  invertible  Xa(p). 

Corollary  3.1.  IfVAa  is  not  identically  zero,  then  there  exists  p  e  P'(m ) 
such  that  Xa(p)  is  invertible;  that  is  Ea  *  $. 
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Proof.  If  V Att  is  not  identically  zero,  there  is  a  point  p  e  P(m  )  such  that 
VAa(p)  *  0.  Either  Aa(p)  =  0  or  not.  If  Aa(p)  *  0,  then  Xa(p)  is  invertible.  If 

Aa(p)  =  0,  then  p  e  and  by  Lemma  3.1  there  is  an  open  set  O  in  P'(m )  such 

that  p  €  O  and  m(E^  P  O)  =  0.  The  set  O  must  have  positive  measure,  since  it  is 

open  and  nonempty.  Consequently,  O  n  E„  is  not  empty  (otherwise,  O  c  Efa — a 

contradiction)  and  any  of  its  members  satisfy  the  conclusion  of  the  corollary. 

//// 

Corollary  3.2.  IfVAa  is  not  identically  zero,  then  there  exists  p  e  P  (rh ) 
such  that  Pr„(p)  interpolates  through  £Ia. 

Proof.  This  follows  from  Corollary  3.1  and  Equation  3.6.  //// 


4.  NETS  WITH  TWO  LAYERS  OF  WEIGHTS 


In  this  section,  we  will  show  how  to  define  the  weights  of  a  net  with  two 
layers  of  weights  so  that  the  transfer  function  of  the  net  interpolates  through 


a  realizable  set  Q  =  { (xj,  yj)  :  1  <.  i  <  k)  when  the  matrix  X* 
rank  k,  where  k  £  m0  +  1. 


Ill" 

xl!  X2!  ...:  xk  has 

L  i :  i  :  :  i  J 


Assume  that  L  =  2,  so  that  m  =  (m0,  nij,  m2)  and  assume  m2  5  mj. 


Choose  a  maximal  subset  of  { yj  :  1  £  i  £  k)  consisting  of  linearly 
independent  vectors  and  let  n  be  its  cardinality;  T|  £  m2.  Without  loss  of 
generality,  we  may  assume  that  yj,  y2,  ....  y^  are  linearly  independent  (the 
vectors  yj  may  be  relabeled  if  necessary).  There  exist  constants  a^  €  R  such 
that 


n 

yi  =  X  aijYj-  Ti  +  1  s  i  S  k. 

j-i 

Set  a  *  maxflajjl  :  1  <,  j  <,  iq ,  tj  +  1  £  i  £  k). 


(4.1) 


14 


NWC  TP  7094 


If  the  matrix  X  defined  above  has  rank  k,  then  there  exists  a  matrix 
V  €  Rkxk°  such  that  VX  =  Ik,  where  Ik  is  the  kxk  identity  matrix  (see  the 

argument  in  Remark  3.1).  If  k  =  m0  +  1,  then  V  =  X'1,  the  inverse  of  X.  Let  el"1 
denote  the  ith  column  of  the  m^mj  identity  matrix  (1  <  i  <  mj).  Fix  a  number 
e  e  (0,  1)  such  that  ea  <  1  and  consider  the  following  vectors  zx  e  R1”1,  1  s  is  k. 


Zi  =  S^(e  e7')  =  S-1(e)  e?1. 


for  1  <,  i  <,  T| 


(4.2a) 


(A L 


Z;  =  S 


i  ‘-’m, 


11,  ea;;  e; 


mi 

v  j=i  j 


=  S'^eay)  emi,  for  r\  +  1  <:  i  <  k.  (4.2b) 
H  J 


Define  an  n^xk  matrix  Z  by  Z  e  [z!  i  z2  : ...  :  zk],  and  set 


[Wj-PjJhZV,  (4.3) 

W2  =  ~[y  i  y2  : :  y„  :  0  i ...  :  0]  €  Rm2xmi>  and  p2  =  0.  (4.4) 


Proposition  4.1.  Assume  X  has  rank  k.  Using  the  notation  of  Section  3 
with  L  =  2,  if  p  =  ([W^Pj],  [W2:P2])  and  Equations  4.3  and  4.4  hold,  then  Fp 

interpolates  through  Cl . 


Proof:  Since  VX  =  Ik,  it  follows  that  V 


L  l  J 


k  k 

=  e.,  where  e.  is  the  ith  column 

i  i 


of  the  kxk  identity  matrix,  1  £  i  £  k.  Thus,  by  Equation  4.3, 


[WjiPj] 


=zv 


=  ZeSzi 


for  1  <  i  <,  k. 


Consequently,  by  Equations  4.2a  and  4.4,  for  1  <  i  <  q,  we  have 


Fp(Xi)  =  [W2:p2] 


Sm^Zj) 

1 


=  W2(e  ef1)  =  yi 


Now,  by  Equations  4.1,  4.2b,  and  4.4,  for  q  +  1  £  i  £  k,  we  have 
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Fp(xj)  =  [W2ip2] 


Sini^zi) 

1 


= 


'  tl  'l  11 

X  ea«j  CT  aijyj  =  yi- 

v  j=i  /  j»i 


Hence,  Fp  interpolates  through  ft. 


//// 


5.  COMPARISONS  AND  APPLICATIONS 

Two  different  techniques  for  determining  the  weights  of  a  FLNN  so  that 
its  transfer  function  interpolates  through  a  given  set  of  points  ft  have  been 
presented.  If  the  cardinality  of  Cl  is  k,  then  both  techniques  require  a  certain 
matrix  to  have  rank  k  in  order  to  interpolate  through  all  of  ft. 

The  technique  presented  in  Section  3  (Technique  1)  requires  the 

columns  of  X£(p')  to  be  linearly  independent  (k  =  (1,  2 .  k);  see  Remark  3.1). 

Since  the  columns  of  X^(p')  are  (mL„i  +  l)-dimensional,  k  can  be  at  the  most 
m  L- 1  +  1.  Thus,  using  Technique  1,  the  net  can  interpolate  through  an 
arbitrary  set  Cl  as  in  Equation  2.4  with  at  most  mL_i  +  1  points,  provided  VA*  is 
not  identically  zero  (see  Corollary  3.1). 

The  technique  presented  in  Section  4  (Technique  2)  was  developed  for 
nets  with  m  =  (m0,  mlt  m2)  and  m2  5  ml.  It  requires  the  columns  of  X  to  be 
linearly  independent.  Since  the  columns  of  X  are  (m0  +  l)-dimensional,  k  can 
be  at  the  most  m0  +  1.  Thus,  using  Technique  2,  the  net  can  interpolate 
through  a  set  ft  as  in  Equation  2.4  with  at  most  m0  +  1  points,  provided  the 
matrix  X  has  full  rank. 

Aside  from  the  difference  in  the  number  of  points  through  which  the  net 
can  interpolate  using  the  two  techniques,  there  is  another  difference. 
Technique  2  will  specify  all  the  weights  in  the  net,  while  Technique  1  only 
specifies  the  last  layer  of  weights  in  terms  of  the  first  L  -  1  layers  of  weights. 
When  the  activation  function  S  is  such  that  the  gradient  of  A*  vanishes  only 

on  a  set  of  measure  zero,  then  as  Remark  3.2  suggests,  one  may  choose  the  first 
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L  *  1  layers  of  weights  at  random  and,  loosely  speaking,  with  probability  1  a 
nonsingular  matrix  (p’>  is  obtained  when  k  =  mL.j  +  1.  More  precisely,  Xfc(p') 

will  be  nonsingular  for  all  p'  except  for  some  points  p’  in  a  set  of  Lebesgue 
measure  zero.  It  can  be  shown  that  any  activation  function  S,  which  becomes 
an  entire  function  (Reference  6)  when  extended  to  the  complex  plane,  has  the 
property  that  VA*  is  zero  only  on  a  set  of  measure  zero  unless  it  vanishes 
identically,  in  which  case  Ai  itself  is  identically  zero.  An  example  of  such  an  S 
is  S(t)  =  2f(t)  -  1  (t  e  R),  where 

f(t)  *  f  e“*2  dx,  (t  e  R). 

-Loo 

In  a  similar  vein,  when  k  =  m0  +  1,  the  matrix  X  will  be  nonsingular  with 

probability  1.  That  is,  X  is  of  the  form  j  with  A  e  Rm°xk,  and  the  set  of 

matrices  A  in  Rmoxk  for  which  X  is  singular  has  Lebesgue  measure  zero. 

For  nets  with  only  one  hidden  layer  (L  =  2)  and  m2  £  raj,  one  has  a  choice 
of  interpolation  techniques.  If  the  goal  is  to  interpolate  through  as  many 
points  as  possible,  clearly  one  uses  the  technique  with  the  largest  permissible 

k.  However,  usually  one  can  select  m^;  thus,  by  choosing  m1  large  enough,  one 

can  interpolate  through  any  number  of  points  using  Technique  1.  On  the 
other  hand,  in  some  applications  the  input  layer  may  be  large  enough  already; 
in  which  case,  Technique  2  may  be  adequate  with  a  more  conservative  value 
for  mt. 

Perhaps  these  techniques  for  determining  the  weight  will  prove  to  be 

most  useful  in  the  initialization  of  weights.  Some  of  the  most  popular  learning 
algorithms  in  use  today  (e.g.,  back  propagation  (Reference  1))  are  based  on 

iterative  steepest  descent  minimization  procedures,  where,  at  each  step,  the 
approximate  solution  is  corrected  in  the  direction  of  steepest  descent  in  order 
to  reduce  the  error.  To  begin  an  initial  set  of  weights  is  required,  which  is 

usually  chosen  at  random.  The  speed  of  convergence  depends  very  heavily  on 
the  quality  of  the  initial  weights;  that  is,  on  how  close  the  initial  weights  are  to 
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the  correct  set  of  weights.  To  improve  the  quality  of  the  initial  weights,  one 
could  select  m0  +  1  representative  input-output  pairs  (prototypes)  such  that 

the  matrix  X  is  invertible  and  calculate  an  initial  set  of  weights  using 
Technique  2.  Alternatively,  one  could  select  mL.i  +  1  representative  input- 

output  pairs,  choose  the  first  L  -  1  layer  of  weights  at  random,  and  calculate 
the  last  set  of  weights  using  Technique  1. 

6.  JACOBIAN  MATRIX  OF  THE  TRANSFER  FUNCTION 

Since  a  small  input  change  to  the  net  produces  an  output  change  whose 
magnitude  is  approximately  bounded  by  the  product  of  the  operator  norm  of 
the  Jacobian  matrix  (total  derivative)  of  the  transfer  function  and  the 
magnitude  of  the  input  change,  we  suggest  that  the  sensitivity  of  the  transfer 
function  F  to  noisy  input  patterns  at  the  interpolation  points  can  be  measured 
by  the  norm  of  the  Jacobian  matrix  of  F  at  the  interpolation  points.  Thus,  in 
this  section,  we  derive  an  expression  for  the  Jacobian  matrix  of  F  and  compute 
an  upper  bound  for  its  norm. 

Let  Wfj  (1  <  i  <  m/t  1  <  j  s  mM)  denote  the  entries  of  the  weight  matrix 

W,(l  <  /  s  L),  and  let  p/  (1  S  j  s  m{)  denote  the  components  of  the  vector 
P/(l  US  L). 

For  each  z  e  RmM  with  components  Zj  (1  <,  i  <  m M),  let  D/(z)  be  an  m(xm ; 
diagonal  matrix  defined  by 
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where  S'  denotes  the  derivative  of  S.  The  expression  above  defines  a  map 
D,:  Rm'-»  ->  Rm'xm'  (1  *  /  <  L).  Recall  that  F  =  (Pl'WlKIl.jo  ...  »  Tj  and  Fn  =  T^F^ 

for  1  £  n  <  L  with  F0  the  identity  map  on  Rm°.  Let  Tfz)  denote  the  Jacobian 

matrix  of  T/  at  z  e  R"1'1  and  [T/z)]jj  its  ijth  entry  (1  <  i  <  m/t  1  <,  j  <  mM,  1  <  /  <  L). 
Use  a  similar  notation  for  the  Jacobian  matrices  of  F  and  Fn  (1  <  n  <  L). 

Proposition  6.1.  Assume  L  S  2. 

T/z)  =  D/(z)W/  (z  €  Rm'  »,  1  S  /  <  L).  (6.1) 


F  (x)  =  WL  J1]  [D,(FM(x))W,]  (x  e  Rm°).  (6.2) 

/«L-1 

Proof:  If  the  components  of  z  6  Rm'-1  are  zi  (1  S  i  £  m/.j),  then 


1  <,  j  £  mM,  1  £i£  m{,  1  $/<  L. 


This  gives  Equation  6.1. 
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Applying  the  chain  rule  to  the  composition  Tn<>Fn_i  and  using  Equation  6.1, 
one  obtains  for  n  =  1,  2,  ....  L  -1 

F„«  =  Tn(Fn.1(x))Fn_1(x)  =  Dn(Fn.j(x))Wn  F^x)  (x  €  Rm°). 
Hence,  by  induction  we  have 

fl-i(x>=  [I[d/(fmWW  (x  e  Rm°).  (6.3) 

/= L-l 

Finally,  since  F(x)  =  WL  Fl.i(x)  +  Pl,  F(x)  =  WL  Fl_j(x),  thus  Equation  6.3  implies 
Equation  6.2.  //// 

It  should  be  noted  that  the  product  indicated  in  Equations  6.2  and  6.3  is  a 
product  of  matrices  that  may  not  commute;  thus,  it  is  important  to  understand 
the  correct  order  of  multiplication;  namely, 

[Dl-i(fL-200)WL-i]  •  [DL.2(FL.3(X))WL.2]  ...  [D,(x)W,]. 

The  sensitivity  of  the  transfer  function  F  at  a  point  x  will  be  measured 
here  by  the  induced  (p,  q)-norm  of  the  linear  transformation  F(x)  for 
particular  values  of  p  and  q.  The  induced  (p,  q)-norms  are  defined  below. 
Other  operator  norms  could  be  used;  however,  the  induced  (p,  q)-norms  lead  to 
upper  bounds  for  the  norm  of  F(x)  that  are  computable  for  certain  values  of  p 
and  q  and  can  be  interpreted  qualitatively. 

The  induced  (p,  q)  matrix  norms  are  defined  next.  A  list  of  properties  of 
(p,  q)-norms  that  will  be  needed  is  also  included.  To  the  author’s  knowledge, 
some  of  the  properties  of  (p,  q)-norms  needed  here  are  not  available  in  the 
open  literature.  For  completeness,  this  material  is  developed  in  Proposition 
6.2. 
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p 

For  1  <  p  <  oo,  the  p-norm  of  a  vector  x  e  Rn  is  defined  by  IIxILe  ^IxjP 


where  x4  (1  <  i  £  n)  are  the  components  of  x  with  respect  to  the  standard  basis 
on  Rn  consisting  of  the  columns  of  the  nxn  identity  matrix  In.  If  p  =  oo,  then 

llxIL  =  max  Ixjl.  If  A:  Rn  -»  Rm  is  a  linear  transformation,  then  the  induced 
lsisn 


n  n  imi 

(p,  q)-norm  of  A  is  defined  by  IfA llpq  =  sup  — « — E-,  where  1  <  p  <  «  and  1  <  q  <  «». 

x*0  I  x  »  q 

Recall  that  we  are  using  the  same  symbol  to  represent  a  linear  transformation 
and  the  matrix  associated  with  it  with  respect  to  the  standard  basis  on  its 
domain  and  range.  If  A  is  an  mxn  matrix,  let  Ajj  denote  its  ijth  entry  (1  <  i  <  m, 

1  ^  j  <  n). 

Proposition  6.2.  Let  A:  Rn  -*  Rm,  B:  Rm  -»  R*.  andCrR*-*  Rm. 


llBAllpq  <  llBllprllAllrq 


llAlL  <  ll[A:C]ll. 


lSp,q,r< 

1  5  p,  q  <  oo. 


(6.4a) 

(6.4b) 


IIaILj  =max  IaJ. 


(6.5) 


If  A  is  a  square  diagonal  matrix,  A  =  diagtAjJ,  then 

ll/p 

M  All  =  XlAul1*  lSp<". 

Li-1 

™  _  1  /P 

IIaILj  =  m  ax  7.  lAylp  1  Sp<», 


'A  ll/p 

llAHp.=  XlA/ 

Li-1 

llAllpj  =  max 

m 

Ik/ 

lSjSn 

„  i*l 

IIaILj =m  ax 
ISiSm 

n 

X  IA«I  . 

.  j*l 

(6.6) 


(6.7) 


(6.8) 


21 


NWC  TP  7094 


II All. p  =  max  ^  lA;jlp 
ISism  _  j=l 


1  <  P  <  OO, 


(6.9) 


where  p*  is  the  conjugate  exponent  to  p;  that  is,  p*  satisfies  —  +  —  =  1. 


P  p' 


Proof:  The  definition  of  llB llpr  implies  llB y llp  <  ilB llpr  lly llr  for  all  y  e  Rm. 

n  ii  I  BAx  I  „  I B  III  Ax  L  ..  ,,  ,,  ,, 

Therefore,  Mb  A  llpq  g  sup  y  B  <  sup - - -  =  IlB  llpr  II A  llrq.  This  gives 

x*o  t  X  B  q  i  x  1  q 

Inequality  6.4a. 

There  exists  x  e  Rn  such  that  llxllq  =  1  and  llAllpq  =  llAxllp.  Consider  the  vector 

y  =  *  e  Rn+/.  Clearly,  llyllq  =  llxllq  =  1  and  ll[A:C]yllp  =  llAxllp  =  II Allpq.  Consequently, 
Inequality  6.4b  holds. 

n  n  n 

If  llxllj  =  1,  then  IIAxIL  =  max  Ay  Xjl  <  max  ^  lAyl  Ixjl  <  max  lAyl  ^  Ixjl  = 

ISiSm  j=l  lsi£m  j=l  i,j  j-=l 

max  I A  j  j  I .  Hence,  the  left-hand  side  (LHS)  of  Equation  6.5  cannot  exceed  the 
right-hand  side  (RHS).  If  lAioJol  =  max  lAyl  and  x  is  the  j0th  column  of  In,  then 

n 

llxllj  =  1  and  llAxlL  =  max  1^  Ay  xjl  =  max  lAy0l  =  lAyJ.  This  establishes  Equation 
lsiim  j=l  lsism 

6.5. 

If  A  is  a  diagonal  (square)  matrix  and  llxll.  =  1,  then 


1/P  HV  «—1 

llAxllp  =  2-  lAiiXjlP  =  2|AiilPMP  ^  2.1a/ 
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Consequently,  the  LHS  of  Equation  6.6  cannot  exceed  the  RHS.  Moreover,  if 


x  =  [1,  1,  ....  1]T  e  Rn,  then  HxIL 
Equation  6.6. 


1  and  llAxllp  = 


i=l 


Iah|p 


i/p 


fhis  establishes 


Let  1  <  p  <  oo,  Hxllj  =  1,  and  consider  the  m-vectors  Vj  =  Xj  aj,  where  aj  denotes 
the  jth  column  of  A  (1  <j<  n).  Since  the  vector  norm  IMIp  satisfies  the  triangle 


n  n 

inequality,  it  follows  that  llXvj^p-  X  llv jl!p.  Consequently, 


j=l  i=i 


m  n 


1/P 


llAxllp  =  £  l£  AjjXjP  =  lllvjllp  <  X  llvjllp -I  I  lAijXj!P 


l  i=l  j=l 


i=l  i-1 


n  f  m 


-  --  1  /p  n  m 

=  X  lxjl '  X  ^A'j^P  ^  X  lxjl  max  X  lAijlP 

i-1  i=l  i-1  .  lsjsn  i=l 


i>-l  L  i-1 
1/p 


1/P 


=  m  ax 
lsjsn 


IlA/ 

i-1 


1/P 


This  shows  that  the  LHS  of  Equation  6.7  cannot  exceed  the  RHS.  Now,  if  the 
maximum  on  the  RHS  of  Equation  6.7  occurs  at  j  =  j0,  let  x  be  the  j0th  column  of 


the  identity  matrix. 


Then  llx!^  =  1  and  llAxllp  =  X  lAjj0lp  =  max  X  lAijlp- 

l£j£n  >*1 


i-1 


Equation  6.7  holds. 


Hence, 


Equation  6.8  is  proved  in  Reference  7. 


Let  1  <  p  <  <»,  llxllp  =  1  and  let  p*  be  the  conjugate  exponent  to  p.  Note  that 
p(p*  -  1)  =  p*.  Then, 


n  it 

llAxlL  =  max  lX  Aijxjl  S  max  X  lAijl  ^  s  llxllp  max 

lSiSm  j-1  lSiSm  j-1  *':'“ 


lSiim 
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where  the  second  inequality  follows  from  Holder's  inequality  (Reference  4). 
Thus,  the  LHS  of  Equation  6.9  cannot  exceed  the  RHS.  Now,  if  the  maximum  on 
the  RHS  of  Equation  6.9  occurs  at  i  =  i0,  let  Xj  =  a  Ay  (Aiojlp  '2  if  Ay  *  0  and  Xj  =  0 


if  Ay  =  o 


(1  <  j  <  n),  where  a  = 


Since  p(p*  -  1)  =  p*.  the  choice 


it  u  II 

of  a  gives  llxllp  =  1  and  IIaILp  ;>  AyXj  I  =  a  lAylp  =  ^  lAylp  I  This 

j=i  j=i  Lh 

establishes  Equation  6.9.  //// 


1/P* 


Remark  6.1.  If  amtx(A)  and  aoin(A) 
the  smallest  singular  values  of  the  matrix 


IIaIIx 

=  °m«(A) 

IIa'1II22 

=  l/amiB(A) 

See  References  7  and  8. 


denote,  respectively,  the  largest  and 
A,  then 

(6.10a) 

A  is  invertible.  (6.10b) 


Applying  Inequality  6.4a  to  Equation  6.2,  one  obtains  the  following  upper 
bound  for  llF'(x)llpq  (1  <  p,  q  <  «,  (L  >  2)). 

llF'(x)llpq  5  llwLllpq  llDL.1(FL.2(x))llq„  IIwl.jILj  [IIdkfm(x))II1„  IIw/ILj].  (6.11) 

Z-L-2 


Note  that  all  of  the  norms  appearing  in  Inequality  6.11  can  be  computed  using 
the  formulas  in  Proposition  6.2  whenever  (p,  q)  =  («,  q),  1  S  q  s  •»,  or  (p,  q)  = 
(p,  1),  1  S  p  S  «  .  They  also  can  be  computed  when  (p,  q)  =  (2,  2).  In  particular, 

if  W1  m  max  1  Wy  I  (1  <,  l  <,  L)  and  [F/_j(x)]j  denotes  the  jth  component  of  F/_](x) 

(1  <,  j  <,  m /.j,  1  S/S  L-l),  then  when  (p,  q)  =  («>,  1),  one  obtains 


L 


IlFWljsIIw' 

M 


(6.12) 
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Inequality  6.12  is  interesting,  since  it  shows  that,  qualitatively  speaking,  the 
sensitivity  at  a  point  (x,  F(x))  will  be  small  if  the  weights  are  small.  In 
particular,  if  the  derivative  of  the  activation  function  S  is  bounded,  say 
S'(t)  <  q  for  all  t  €  R,  then 

IlF'CxjILj  £  r\L1f[  W'  •  ft  m,  (x  e  Rm°). 

/=i  /-l 

Moreover,  if  S'(t)  -*  0  as  t  ->  ±  «,  then  Inequality  6.12  suggests  that  the 
sensitivity  (i.e.,  the  norm  of  the  Jacobian  matrix)  can  be  made  small  at  the 
points  of  interpolation  by  choosing  I  fl/ 1  large  (1  <  i  <  m/(  1  <,  l  <  L  -  1,  L  S  2). 
Whether  one  can  simultaneously  choose  the  biases  I  p/  I  large,  keep  the 
weights  small,  and  interpolate  through  a  set  of  points  Q  is  a  topic  for  further 
research. 

As  an  example  of  the  applications  of  the  theory  developed  in  this  section, 
we  can  investigate  the  sensitivity  of  the  transfer  functions  that  are  obtained 
using  the  Interpolation  Techniques  1  and  2  that  were  presented  in  earlier 
sections.  It  will  be  assumed  that  L  =  2. 

When  L  =  2,  Proposition  6.1  gives 

F(x)  =  W2  Dx(x)  Wj  (x  e  Rm°) 

from  which  it  follows  (by  Inequality  6.4a), 

IlF’tollpq  5  IIWjllp,  llD1(x)llrt  llWjiltq.  (6.13) 

If  t  =  oo,  we  can  use  Equation  6.6  to  compute  llDj(x)llr-  for  1  £  r<  ».  Note  that  the 
elements  of  the  diagonal  matrix  Dj(x)  coincide  with  the  components  of  the 

vector  ^([Wj  iPj]  *  ),  where  S^,  is  the  map  defined  by  Sm,(z)  =  [S'Uj),  S’(z2), 

....  S'(zm  ,)]T  for  every  z  =  [zj,  z2 . zmJT  €  Rmi-  Therefore,  it  follows  from 

Equation  6.6  that  the  (r,  oo)-norm  of  the  matrix  Dj(x)  coincides  with  the  r-norm 

of  the  vector  Sm^WjiPj]  J  ), 
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llD1(x)llr.  =  IIS„1([W1  :P1][jJ)llr  (xeRmo).  (6.14) 

Now,  consider  Technique  2  for  interpolating  through  Q  =  {(xlt  yj)  e 
Rm°xRm2:  1  <  i  <  k},  with  k  =  m0  +  1  and  m2  <  mj.  Assume  X  is  invertible.  Recall 
that 


[Wj-pj] 


=  Z:  = 


L  i  J  1 


t 


S*1(eaij)ej'1 


for  1  <  i  <  Ti 

for  t|  +  1  <  i  <  k. 


Hence,  Equation  6.14  gives 


llD1(xi)llr.= 


S’CS'^e)) 


[S'CS  keaij))]1 
.  j*1 


l/r 


for  1  <  i  <  ti 


for  i)  +  1  £  i  £  k. 


(6.15) 


Next,  if  the  components  of  the  vectors  yj  are  denoted  by  y^,  1  <  j  £  m2, 

1  <  i  £  n,  and  Y  *  max  lyy.l,  then  by  Equations  4.4  and  6.5, 

>j 


llw2IL1  =  iY. 


(6.16) 


From  Inequalities  6.4  and  Equation  4.3  one  obtains 
IIWjIL.2  £  II  zx'1!!^  £  II  z  ll.2  II  X'1ll22. 

By  Equations  6.9  and  4.2,  and  the  definition  of  Z, 


llzlL2  =  max 


lSiSr) 


fSVe)]2 


j-n+i 


fSVcaji)]2 


1/2 


(6.17) 


(6.18) 
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Finally,  since  llx _1  U22  =  — — -  (Reference  8),  by  combining  Equations  6.15 

Omin(X) 

through  6.18  and  Inequality  6.17  with  Inequality  6.13  with  p  =  q  =  2,  r  =  1,  and 
t  =  «>,  one  obtains  the  following  upper  bound  for  MF’(Xi)ll^2» 


IIf'(x,)II.2  s  [i  Y]  _!_■ 

Omin(X) 


k  11/2 

tax  [S_1(e)]2  +  2^  [S*1(eaji)]  2  llD1(xi)ll1«.  (6.19) 


Under  certain  conditions  on  the  activation  function  S,  one  can  obtain  a 

simpler  but  more  conservative  upper  bound  by  considering  a  *  max  layl  and  a  * 

>j 

minlajjl.  Note  that  if  S  is  strictly  increasing  and  symmetric  about  the  origin, 

then  ls_1(eajj)l£  S'^ea)  for  all  i,  j.  Moreover,  if  we  assume  that  S'  is  strictly 
decreasing  on  (0,  ~),  then  S^S'^eay))  2S  S^S’^ea)  for  all  i,  j.  Under  this 
condition  on  S  and  S',  it  follows  that 

llf  1  1/2 

IlF'Mlz  <  [7  Y]  •  •  [S-\z)]2  +  (k  -  t,)  (S'^ea)]2  ■  Q  . 

fS'(S*1(e>)  for  1  S  i  S  it 

t\  S\S-\za))  for  r\  +  1  $  i  S  k. 

If  Technique  1  is  used  to  interpolate  through  a  set  Cl  with  k  =  m}  +  1 

points,  then  [W2ip2]  =  Y0  X^1([W1  :p  j  ]),  where  the  multi-index  a  =  (1,  2,  ....  k). 
Note  that  [W2ip2]  is  a  function  of  [Wj  iPjJ.  Setting  pr«,r  =  2,t»»,  and  q  =  1  in 
Inequality  6.13  and  applying  Inequalities  6.4  to  [W2ip2],  one  obtains 

IIf’(x)IL,  £  HyJL2  IbrtiWj  iPj])!^  IId,(x)II2.  Iw,ILi. 

This  inequality  can  be  written  more  explicitly  as 
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IlF’tolLj  <  IIyJL 


i 


Mi  nip 

2^s’<Xwijxj  +  Pi>}2 

i=l  j=l 


1/2 


■W1.  (6.20) 


As  we  pointed  out  in  Section  5,  application  of  Technique  1  requires  a 
selection  of  the  first  layer  of  weights  [WjiPj]  before  computing  [W2:|i2].  We 
gave  no  guideline  on  how  to  select  [WjiPj].  Here  we  suggest  that  after  an 
initial  random  choice  of  [WjiPj],  one  proceeds  to  select  new  choices  for 


using  iterative  minimization  procedures  to  reduce  the  RHS  of  Inequality  6.20  at 
the  points  of  interpolation  x4  (1  <  i  <  k),  thereby  reducing  the  sensitivity  of  the 


transfer  function  F  with  each  iteration.  The  details  of  such  a  sensitivity- 
minimization  algorithm  are  still  under  investigation. 


7.  SUMMARY 

Finding  the  weight  of  a  feedforward  layered  neural  network  so  that  the 
resulting  transfer  function  maps  a  set  of  inputs  to  a  desired  set  of  outputs  was 
described  as  an  interpolation  problem.  It  was  shown  how  to  define  the  weights 
so  that  the  net  interpolates  through  a  set  of  m0  +  1  points,  where  m0  is  the 

number  of  inputs  and  the  net  has  one  hidden  layer. 

It  was  also  shown  how  to  select  the  last  layer  of  weights  of  a 

multilayered  net  so  that  the  net  interpolates  through  a  set  of  mL.j  +  1  points, 
where  mL.!  is  the  number  of  neurons  in  the  layer  preceding  the  output  layer. 

These  two  approaches  (Techniques  1  and  2)  provide  a  partial  solution  to  the 
interpolation  problem  posed  in  Section  2.  Moreover,  both  of  the  numbers  m0  + 
1  and  mt  +  1  serve  as  lower  bounds  for  the  interpolation  capacity  of  nets  with 
one  hidden  layer,  and  the  number  mL.j  +  1  is  a  lower  bound  for  the 

interpolation  capacity  of  nets  with  L  layers  of  weights  when  L  >  2. 

The  Jacobian  matrix  of  the  transfer  function  was  computed.  Its  operator 
norm  was  used  as  a  sensitivity  measure  of  the  transfer  function  to  variations 

in  its  input.  The  induced  (p,  q)  matrix  norms  were  introduced  together  with 

some  of  their  properties  in  order  to  obtain  computable  upper  bounds  on  the 
norm  of  the  Jacobian  matrix  at  the  points  of  interpolation.  The  results  suggest 
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that  small  weights  are  required  for  low  sensitivity.  It  was  also  suggested  that 
the  freedom  that  exists  in  the  selection  of  the  first  L  -  1  layers  of  weights, 
when  using  Technique  1  for  interpolation,  can  be  exploited  in  order  to 
minimize  the  sensitivity  of  the  transfer  function  at  the  points  of  interpolation. 
The  details  of  such  a  minimization  algorithm  is  a  topic  for  further  research. 

Another  problem  that  is  still  under  investigation  is  whether  the  first 
L  -  1  layers  of  weights  can  be  selected  (as  well  as  the  last  layer  of  weights)  so 
that  the  net  interpolates  through  more  than  mL.j  +  1  points.  For  example, 
when  L  =  2  and  m0  >  mlt  the  net  can  realize  more  than  mj  +  1  points;  namely, 
m0  +  1  using  Technique  2. 
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